GRPO refactoring #2530

mydatascience · 2025-10-21T18:53:11Z

Description

Refactoring of grpo. Adding new unified functionality allowing to add models easily

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

.github/workflows/RunTests.yml

src/MaxText/examples/GRPO_README.md

kyle-meggs · 2025-10-23T15:29:23Z

src/MaxText/examples/GRPO_README.md

+- `grpo_llama3_1_8b_demo_pw.py` - Pathways-based 8B model  
+- `grpo_llama3_1_70b_demo_pw.py` - Pathways-based 70B model
+
+These have been consolidated into a single **unified CLI script** (`grpo_demo.py`) that works with the new **grpo.yml** configuration file.


again - should be "demo"?

to me, demo indicates it may not be suitable for production workloads

src/MaxText/examples/README.md

src/MaxText/experimental/rl/grpo_tunix_trainer.py

github-actions · 2025-10-27T21:43:17Z

🤖 Hi @A9isha, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

src/MaxText/examples/grpo_runner.py

SurbhiJainUSC · 2025-10-31T19:42:19Z

src/MaxText/rl/evaluate_rl.py

Should we move evaluate_rl.py, rl_utils.py and train_rl.py inside src/MaxText/rl?

SurbhiJainUSC · 2025-10-31T19:44:58Z

src/MaxText/rl/train_rl.py

+  # ====== Debug flag for verbose logs ======
+  DEBUG = tmvp_config.debug
+
+  print("Starting GRPO Training")


Use max_logging.log

SurbhiJainUSC · 2025-10-31T19:45:54Z

src/MaxText/train_rl.py

+    os.makedirs(data_dir)
+
+  data = tfds.data_source(
+      "gsm8k",


Dataset name should come from tmvp_config.

SurbhiJainUSC · 2025-10-31T19:47:12Z

src/MaxText/rl/train_rl.py

+
+  # ====== Data ======
+  # Setup data directories
+  home = os.path.expanduser("~") + "/"


If we switch to Hugging Face API to load dataset, we won't need to setup these data directories. Hugging Face would download the data in a cache. Not sure if TFDS can also do that.

SurbhiJainUSC · 2025-10-31T19:48:27Z

src/MaxText/rl/train_rl.py

+    os.makedirs(test_data_dir)
+
+  # Create model tokenizer
+  model_tokenizer = AutoTokenizer.from_pretrained(tmvp_config.hf_model_name)


base.yml has tokenizer_path that can be used here, right?

xuefgu

Thanks for the PR @A9isha !

In addition to the comments - can you please clarify what tests you have performed, i.e. the hardware, model, and more importantly with what configs (since that's the main change here)?

xuefgu · 2025-10-31T19:53:57Z

src/MaxText/rl/evaluate_rl.py

xuefgu · 2025-10-31T20:02:32Z

maxtext_grpo_dependencies.Dockerfile

+# # Install vLLM for Jax and TPUs from the artifact registry
+# RUN VLLM_TARGET_DEVICE="tpu" pip install --no-cache-dir --pre \
+#     --index-url https://us-python.pkg.dev/cloud-tpu-images/maxtext-rl/simple/ \
+#     --extra-index-url https://pypi.org/simple/ \
+#     --extra-index-url https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ \
+#     --extra-index-url https://download.pytorch.org/whl/nightly/cpu \
+#     --find-links https://storage.googleapis.com/jax-releases/libtpu_releases.html \
+#     --find-links https://storage.googleapis.com/libtpu-wheels/index.html \
+#     --find-links https://storage.googleapis.com/libtpu-releases/index.html \
+#     --find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html \
+#     --find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html \
+#     vllm==0.11.1rc1.dev292+g1b86bd8e1.tpu
+
+# # Install tpu-commons from the artifact registry
+# RUN pip install --no-cache-dir --pre \
+#     --index-url https://us-python.pkg.dev/cloud-tpu-images/maxtext-rl/simple/ \
+#     --extra-index-url https://pypi.org/simple/ \
+#     --extra-index-url https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ \
+#     --find-links https://storage.googleapis.com/jax-releases/libtpu_releases.html \
+#     tpu-commons==0.1.2
+
+# # Uninstall existing jax to avoid conflicts
+# # RUN pip uninstall -y jax jaxlib libtpu
+
+# # --- STAGE 1: Install Static Dependencies ---
+# # Install any packages *not* defined in your project dependency files
+# RUN --mount=type=cache,target=/root/.cache/pip pip install \
+#     aiohttp==3.12.15\
+#     keyring \
+#     keyrings.google-artifactregistry-auth
+
+# RUN --mount=type=cache,target=/root/.cache/pip pip install \
+#     numba==0.61.2
+
+# # RUN VLLM_TARGET_DEVICE="tpu" pip install vllm
+# # --- STAGE 2: Install Project Dependencies (The Main Cached Layer) ---
+
+# # Copy *only* the dependency definition files.
+# # This assumes vllm and tpu-inference are in the build context, copied from the parent directory.
+# COPY vllm/requirements/tpu.txt /tmp/
+# COPY vllm/requirements/build.txt /tmp/
+# COPY vllm/requirements/common.txt /tmp/
+# COPY tpu-inference/requirements.txt /tmp/
+
+# # Run the full dependency installation.
+# # This entire layer is cached and will *only* be rebuilt if
+# # these .txt files change.
+# RUN --mount=type=cache,target=/root/.cache/pip bash -c ' \ 
+#     # Set the target device so pip installs the right JAX/libtpu
+#     # Install tpu-inference dependencies
+#     export VLLM_TARGET_DEVICE="tpu" && \
+#     pip install -r /tmp/tpu.txt -r /tmp/build.txt -r /tmp/common.txt -r /tmp/requirements.txt --no-cache-dir --pre \
+#         --extra-index-url https://pypi.org/simple/ \
+#         --extra-index-url https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ \
+#         --extra-index-url https://download.pytorch.org/whl/nightly/cpu \
+#         --find-links https://storage.googleapis.com/jax-releases/libtpu_releases.html \
+#         --find-links https://storage.googleapis.com/libtpu-wheels/index.html \
+#         --find-links https://storage.googleapis.com/libtpu-releases/index.html \
+#         --find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html \
+#         --find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html'
+


Please clean up this large block of commented code. Some of them is no longer relevant.

xuefgu · 2025-10-31T20:36:59Z

src/MaxText/rl/train_rl.py

+        os.environ.get("LIBTPU_INIT_ARGS", "") + " --xla_tpu_spmd_rng_bit_generator_unsafe=true"
+    )
+
+  tmvp_config = pyconfig.initialize(argv)


Just use config as the variable name?

I think tmvp_configs helps articulate that we have all configs together here

xuefgu · 2025-10-31T20:41:40Z

src/MaxText/rl/train_rl.py

+    num_trainer_devices = int(num_devices * tmvp_config.trainer_devices_fraction)
+    num_sampler_devices = int(num_devices * tmvp_config.sampler_devices_fraction)


In config, should we add a check for "if using pathways, trainer_devices_fraction + sampler_devices_fraction should not exceed 1"? I find the behavior hard to reason about if the sum is larger than 1 for disaggregated RL.

we do allow trainer_devices_fraction = sampler_devices_fraction = 1.0 where we use the full mesh for both training and inference, i.e., without disaggregate but still multihost

src/MaxText/train_rl.py

xuefgu · 2025-10-31T20:56:41Z

src/MaxText/rl/train_rl.py

+  # Load policy model
+  print("Creating policy model with same config as reference model on trainer mesh")
+  policy_model, policy_mesh = get_maxtext_model(tmvp_config, trainer_devices)
+  actor_mesh = policy_mesh


For all intents and purposes, we don't need actor_mesh and can just use policy_mesh in line 317.

Or, we could call the vars actor_model and actor_mesh in line 262. I prefer this.

The point is that there is no material difference between "actor" and "policy" in the context, so distinguishing them is quite confusing.

src/MaxText/train_rl.py

docker_build_dependency_image.sh

src/MaxText/configs/base.yml

src/MaxText/configs/rl.yml

SurbhiJainUSC · 2025-10-31T21:13:07Z

src/MaxText/configs/rl.yml

+
+# ====== System prompt and Templates ======
+
+system_prompt: |


This is dataset specific and can be moved to examples script.
I recently added a templates folder in MaxText, where we have this template for GSM8K dataset. We can use that.

SurbhiJainUSC · 2025-10-31T21:16:07Z

src/MaxText/rl/train_rl.py

+from MaxText import rl_utils
+
+
+# We use OpenAI's GSM8K dataset. GSM8K comprises grade school math word problems.


Comment at line 46 mentions that this file can be used to run GRPO on a custom dataset too. Can we move all GSM8K related stuff to examples?

Signed-off-by: Vladimir Suvorov <[email protected]>

… pending

mydatascience requested review from A9isha, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jacoguzo, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, suexu1025 and vipannalla as code owners October 21, 2025 18:53

github-advanced-security bot found potential problems Oct 21, 2025

View reviewed changes

.github/workflows/RunTests.yml Fixed Show fixed Hide fixed

kyle-meggs reviewed Oct 23, 2025

View reviewed changes

src/MaxText/examples/GRPO_README.md Outdated Show resolved Hide resolved

kyle-meggs reviewed Oct 23, 2025

View reviewed changes

mydatascience force-pushed the universal_grpo branch from 149a3dc to e72906d Compare October 24, 2025 17:15

mydatascience requested a review from xuefgu as a code owner October 24, 2025 17:15

A9isha reviewed Oct 24, 2025

View reviewed changes

src/MaxText/examples/README.md Outdated Show resolved Hide resolved

A9isha reviewed Oct 24, 2025

View reviewed changes

src/MaxText/examples/README.md Outdated Show resolved Hide resolved

A9isha reviewed Oct 24, 2025

View reviewed changes

src/MaxText/experimental/rl/grpo_tunix_trainer.py Outdated Show resolved Hide resolved

A9isha added the gemini-review label Oct 27, 2025

SurbhiJainUSC reviewed Oct 28, 2025

View reviewed changes

src/MaxText/examples/grpo_runner.py Outdated Show resolved Hide resolved

A9isha force-pushed the universal_grpo branch from 8453278 to e356292 Compare October 28, 2025 17:18

SurbhiJainUSC reviewed Oct 31, 2025

View reviewed changes

A9isha force-pushed the universal_grpo branch from babbccb to ec3e2d4 Compare October 31, 2025 19:49

SurbhiJainUSC reviewed Oct 31, 2025

View reviewed changes

xuefgu reviewed Oct 31, 2025

View reviewed changes

SurbhiJainUSC reviewed Oct 31, 2025

View reviewed changes

docker_build_dependency_image.sh Show resolved Hide resolved

src/MaxText/configs/base.yml Show resolved Hide resolved

src/MaxText/configs/rl.yml Outdated Show resolved Hide resolved

src/MaxText/configs/rl.yml Outdated Show resolved Hide resolved

SurbhiJainUSC reviewed Oct 31, 2025

View reviewed changes

mydatascience and others added 20 commits October 31, 2025 22:09

grpo refactoring

a703dfd

Signed-off-by: Vladimir Suvorov <[email protected]>

Fix

32a441c

Signed-off-by: Vladimir Suvorov <[email protected]>

Fix

cf4e918

Signed-off-by: Vladimir Suvorov <[email protected]>

grpo refactor

115bf84

Signed-off-by: Vladimir Suvorov <[email protected]>

grpo refactor

e1708e6

Signed-off-by: Vladimir Suvorov <[email protected]>

Fix

d5bcf7a

Signed-off-by: Vladimir Suvorov <[email protected]>

Fix naming

81e4c2b

Signed-off-by: Vladimir Suvorov <[email protected]>

simplification of nb

0b011a4

Signed-off-by: Vladimir Suvorov <[email protected]>

simplification of nb

28fb192

Signed-off-by: Vladimir Suvorov <[email protected]>

fix

cebf7c9

Signed-off-by: Vladimir Suvorov <[email protected]>

fix

280be97

Signed-off-by: Vladimir Suvorov <[email protected]>

Fix

129cf57

Signed-off-by: Vladimir Suvorov <[email protected]>

nit changes

6dfab38

WIP src/MaxText/rl/rl_trainer.py, delete grpo_runner

d514567

train_rl created, refactoring WIP

db589d3

create rl_utils

22a2623

refactored

27305f8

update to flag=post-training

dc6d0a1

address PR feedeback

e640c7e

add chat template, metrics issue and PoolingRequestOutput issue still…

3249753

… pending

A9isha force-pushed the universal_grpo branch from ec3e2d4 to 3249753 Compare October 31, 2025 23:30

		num_trainer_devices = int(num_devices * tmvp_config.trainer_devices_fraction)
		num_sampler_devices = int(num_devices * tmvp_config.sampler_devices_fraction)

		from MaxText import rl_utils


		# We use OpenAI's GSM8K dataset. GSM8K comprises grade school math word problems.

GRPO refactoring #2530

Are you sure you want to change the base?

GRPO refactoring #2530

Conversation

mydatascience commented Oct 21, 2025

Description

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuefgu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants